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Abstract — We refine and extend an earlier MDL denoising 
criterion for wavelet-based denoising. We start by showing that 
the denoising problem can be reformulated as a clustering 
problem, where the goal is to obtain separate clusters for 
informative and non-informative wavelet coefficients, respectively. 
This suggests two refinements, adding a code-length for the 
model index, and extending the model in order to account for 
subband-dependent coefficient distributions. A third refinement 
is derivation of soft thresholding inspired by predictive universal 
coding with weighted mixtures. We propose a practical method 
incorporating all three refinements, which is shown to achieve 
good performance and robustness in denoising both artificial and 
natural signals. 

Index Terms — Minimum description length (MDL) principle, 
wavelets, denoising. 

I. Introduction 

WAVELETS are widely applied in many areas of signal 
processing [1], where their popularity owes largely to 
efficient algorithms on the one hand and advantages of sparse 
wavelet representations on the other The sparseness property 
means that while the distribution of the original signal values 
may be very diffuse, the distribution of the corresponding 
wavelet coefficients is often highly concentrated, having a 
small number of very large values and a large majority of 
very small values [2]. It is easy to appreciate the importance of 
sparseness in signal compression, [3], [4]. The task of remov- 
ing noise from signals, or denoising, has an intimate link to 
data compression, and many denoising methods are explicitly 
designed to take advantage of sparseness and compressibility 
i in the wavelet domain, see e.g., [5]-[7]. 

Among the various wavelet-based denoising methods those 
suggested by Donoho and Johnstone [8], [9] are the best 
known. They follow the frequentist minimax approach, where 
the objective is to asymptotically minimize the worst-case 
risk simultaneously for signals, for instance, in the entire scale 
of Holder, Sobolev, or Besov classes, characterized by certain 
smoothness conditions. By contrast, Bayesian denoising meth- 
ods minimize the expected (Bayes) risk, where the expectation 
is taken over a given prior distribution supposed to govern the 
unknown true signal [10], [11]. Appropriate prior models with 
very good performance in typical benchmark tests, especially 
for images, include the class of generalized Gaussian densities 
[6], [12], [13], and scale-mixtures of Gaussians [14], [15] 
(both of which include the Gaussian and double exponential 
densities as special cases). 
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A third approach to denoising is based on the minimum 
description length (MDL) principle [16]-[19]. Several differ- 
ent MDL denoising methods have been suggested [6], [12], 
[20]-[24]. We focus on what we consider as the most pure 
MDL approach, namely that of Rissanen [23]. Our motivation 
is two-fold: First, as an immediate result of refining and 
extending the earlier MDL denoising method, we obtain a 
new practical method with greatly improved performance 
and robustness. Secondly, the denoising problem turns out 
to illustrate theoretical issues related to the MDL principle, 
involving the problem of unbounded parametric complexity 
and the necessity of encoding the model class. The study of 
denoising gives new insight to these issues. 

Formally, the denoising problem is the following. Let j/" — 
{Ui, . . . , yn)^ be a signal represented by a real- valued column 
vector of length n. The signal can be, for instance, a time- 
series or an image with its pixels read in a row-by-row order. 
Let W be an n X m regressor matrix whose columns are 
basis vectors. We model the signal y" as a linear combination 
of the basis vectors, weighted by coefficient vector /3" = 
(/?!, . . . , Pm)^ , plus Gaussian i.i.d. noise: 

y«=>V/3"+e", e/-^'-AA(0,a^), (1) 

where ct^ is the noise variance. Given an observed signal 
y", the ideal is to obtain a coefficient vector /3™ such that 
the signal given by the transform y" = W/3™ contains the 
informative part of the observed signal, and the difference y" — 
is noise. 

For technical convenience, we adopt the common restriction 
on W that the basis vectors span a complete orthonormalhasis. 
This implies that the number of basis vectors is equal to the 
length of the signal, m — n, and that all the basis vectors 
are orthogonal unit vectors. There are a number of wavelet 
transforms that conform to this restriction, for instance, the 
Haar transform and the family of Daubechies transforms [1], 
[25]. Formally, the matrix W is of size n x n and orthogonal 
with its inverse equal to its transpose. Also the mapping /3" 
yy/3" preserves the Euclidean norm, and we have Parseval's 
equality: 

ll/3"ll = = v/(W/3",W/3") = l|W/3"||. (2) 

Geometrically this means that the mapping /3" ^ W/?" is a 
rotation and/or a reflection. From a statistical point of view, 
this implies that any spherically symmetric density, such as 
Gaussian, is invariant under this mapping. All these properties 
are shared by the mapping i-^ W'^y". We call /3" i-^ W/3" 
the inverse wavelet transform, and i—s- W^y^ the forward 
wavelet transform. Note that in practice the transforms are not 
implemented as matrix multiplications but by a fast wavelet 
transform similar to the fast Fourier transform (see [1]), and 
in fact not even the matrices need be written down. 
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For complete bases, the conventional maximum likelihood 
(least squares) method obviously fails to provide denoising un- 
less the coefficients are somehow restricted since the solution 
/?" = W^y'^ gives the reconstruction y" = WW^y'^ = y" 
equal to the original signal, including noise. The solution 
proposed by Rissanen [23] is to consider each subset of the 
basis vectors separately and to choose the subset that allows 
the shortest description of the data at hand. The length of 
the description is determined by the normalized maximum 
likelihood (NML) code length. 

The NML model involves an integral, which is undefined 
unless the range of integration (the support) is restricted. This, 
in turn, implies hyper parameters, which have received in- 
creasing attention in various contexts involving, e.g., Gaussian, 
Poisson and geometric models [17], [19], [26]-[29]. Rissanen 
used renormalization to remove them and to obtain a second- 
level NML model. Although the range of integration has 
to be restricted also in the second-level NML model, the 
range for ordinary regression problems does not affect the 
resulting criterion and can be ignored. Roos et al. [30] give an 
interpretation of the method which avoids the renormalization 
procedure and at the same time gives a simplified view of the 
denoising process in terms of two Gaussian distributions fitted 
to informative and non-informative coefficients, respectively. 
In this paper we carry this interpretation further and show 
that viewing the denoising problem as a clustering problem 
suggests several refinements and extensions to the original 
method. 

The rest of this paper is organized as follows. In SecHllwe 
reformulate the denoising problem as a task of clustering the 
wavelet coefficients in two or more sets with different distri- 
butions. In SecHnlwe propose three different modifications of 
Rissanen's method, suggested by the clustering interpretation. 
In Sec. lIVI the modifications are shown to significantly improve 
the performance of the method in denoising both artificial and 
natural signals. The conclusions are summarized in Sec. |Vl 

II. Denoising and Clustering 
A. Extended Model 

We rederive the basic model ([Q in such a way that there 
is no need for renormalization. This is achieved by inclusion 
of the coefficient vector (3 in the model as a variable and by 
selection of a (prior) density for fi. While the resulting NML 
model will be equivalent to Rissanen's renormalized solution, 
the new formulation is easier to interpret and directly suggests 
several refinements and extensions. 

Consider a fixed subset 7 C {1, . . . ,n} of the coefficient 
indices. We model the coefficients j3i for i e 7 as independent 
outcomes from a Gaussian distribution with variance r^. In the 
basic hard threshold version all Pi for i ^ 7 are forced to equal 
zero. Thus the extended model is given by 



y- = >V/3" + e", {(3rJ-M{Q,T^), if* £7, (3) 
^(3i = 0, otherwise. 

This way of modeling the coefficients is akin to the so 
called spike and slab model often used in Bayesian vari- 



able selection [31], [32] and applications to wavelet-based 
denoising [33], [34] (and references therein). In relation to the 
sparseness property mentioned in the introduction, the 'spike' 
consists of coefficients with 1^7 that are equal to zero, while 
the 'slab' consists of coefficients with 167 described by a 
Gaussian density with mean zero. This is a simple form of a 
scale-mixture of Gaussians with two components. In Sec. lIII-Bl 
we will consider a model with more than two components. 

Let c" = /3" + >V"^e", where W-^e" gives the representation 
of the noise in the wavelet domain. The vector is the 
wavelet representation of the signal y", and we have 

It is easy to see that the maximum likelihood parameters are 
obtained directly from 



Ci, if i e 7, 

0, otherwise. 



(4) 



The i.i.d. Gaussian distribution for e" in (|3j implies that the 
distribution of W^e"' is also i.i.d. and Gaussian with the 
same variance, aj^. As a sum of two independent random 
variates, each Ci has a distribution given by the convolution of 
the densities of the summands, and the ith component of 
>V-^e". In the case 1^7 this is simply A/'(0, cr^). In the case 
i G 7 the density of the sum is also Gaussian, with variance 
given by the sum of the variances, + aj^. All told, we have 
the following simplified representation of the extended model 
where the parameters /?" are implicit: 



= >Vc" 



i.i.d. 



[AA(0, a|), 



if i e 7, 
otherwise. 



(5) 



where ct| := 



ajq denotes the variance of the informative 
coefficients, and we have the important restriction aj > aj^ 
which we will discuss more below. 



B. Denoising Criterion 

The task of choosing a subset 7 can now be seen as a 
clustering problem: each wavelet coefficient belongs either 
to the set of the informative coefficients with variance aj, 
or the set of non-informative coefficients with variance aj^. 
The MDL principle gives a natural clustering criterion by 
minimization of the code-length achieved for the observed 
signal (see [35]). Once the optimal subset is identified, the 
denoised signal is obtained by setting the wavelet coefficients 
to their maximum likelihood values (0}; i.e., retaining the 
coefficients in 7 and discarding the rest, and doing the inverse 
transformation. It is well known that this amounts to an 
orthogonal projection of the signal to the subspace spanned 
by the wavelet basis vectors in 7. 

The code length under the model (|5j depends on the values 
of the two parameters, and ct^. The standard solution in 
such a case is to construct a single representative model for 
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the whole model class' such that the representative model is 
universal (can mimic any of the densities in the represented 
model class). The minimax optimal universal model (see [36]) 
is given by the so called normaUzed maximum likelihood 
(NML) model, originally proposed by Shtarkov [37] for data 
compression. We now consider the NML model corresponding 
to the extended model (|5} with the index set 7 fixed. 

Denote by fc = fc(7) the number of coefficients for which 
i G 7. The NML density under the extended model Q for a 
given coefficient subset 7 is defined as 



/«m;(y" ; 7) 



where aj — irKy") and o-^ = d-j^{y") are the maximum 
likelihood parameters for the data y", and is the important 
normalizing constant. The constant C\ is also known as the 
parametric complexity of the model class defined by 7. 

Restricting the data such that the maximum likelihood 
parameters satisfy 



< 



-min — "^Ni ^ ^ 



2 

max ' 



and ignoring the constraint aj^ < erf, the code length under crj is exactly one bit. 
the extended model Q is approximated by^ 



noise it gives better results than a universal threshold of 
Donoho and Johnstone [8] (VisuShrink), over-fitting occurs in 
noisy cases [30] (see also Sec. II VI below), which is explained 
by the fact that omission of the third term is justified only in 
regression problems with few parameters. 

Remark 2: It was proved in [23] that the criterion (|6j is 
minimized by a subset 7 which consists of some number k of 
the largest or smallest wavelet coefficients in absolute value. It 
was also felt that in denoising applications the data are such 
that the largest coefficients will minimize the criterion. The 
above alternative formulation gives a natural solution to this 
question: by the inequality <t| > cr^, the set of coefficients 
with larger variance, i.e., the one with larger absolute values 
should be retained, rather than vice versa. 

Remark 3: In reality the NML model corresponding to the 
extended model Q is identical to Rissanen's renormalized 
model only if the inequality aj > aj^ is ignored in the calcu- 
lations (see the appendix). However, the following proposition 
(proved in the appendix) shows that the effect of doing so is 
independent of k, and hence irrelevant. 

Proposition 1: The effect of ignoring the constraint cr^ < 



■In 



2^- 



^ - + -\nk{n-k), 



(6) 



plus a constant independent of 7, with S{y") and S^{y") 
denoting the sum of the squares of all the wavelet coefficients 
and the coefficients for which 1^7, respectively (see the 
appendix for a proof). The code length formula is very 
accurate even for small n since it involves only the Stirling 
approximation of the Gamma function. 

Remark 1: The set of sequences satisfying the restriction 
c^min — ' '^i — ^max depends on 7. For instance, consider 
the case n = 2. In a model with fc = 1, the restriction corre- 
sponds to a union of four squares, whereas in a model with 
either fc = or fc = 2, the relevant area is an annulus (two- 
dimensional spherical shell). However, the restriction can be 
understood as a definition of the support of the corresponding 
NML model, not a rigid restriction on the data, and hence 
models with varying 7 are still comparable as long as the 
maximum likelihood parameters for the observed sequence 
satisfy the restriction. 

The code length obtained is identical to that derived by 
Rissanen with renormalization [23] (note the correction to the 
third term of (|6j in [38]). The formula has a concise and 
suggestive form that originally lead to the interpretation in 
terms of two Gaussian densities [30]. It is also the form that 
has been used in subsequent experimental work with somewhat 
mixed conclusions [30], [39]: While for Gaussian low variance 

'Here the usual terminology where the word 'model' has double meaning 
is somewhat unfortunate. The term refers to both a set of densities such as the 
one defined by Eq. JsJ (as in the 'Gaussian model', or the 'logistic model'), 
and a single density such as the NML model, which can of course be thought 
of as a singleton set. Whenever there is a chance of confusion, we use the 
term 'model class' in the first sense. 

^We express code lengths in nats which corresponds to the use of the natural 
logarithm. One nat is equal to (In 2)^^ bits. 



We can safely ignore the constraint and use the model with- 
out the constraint as a starting point for further developments 
for the sake of mathematical convenience. 



III. Refined MDL Denoising 

A. Encoding the Model Class 

It is customary to ignore encoding of the index of the model 
class in MDL model selection; i.e., encoding the number of 
parameters when the class is in one-to-one correspondence 
with the number of parameters. One simply picks the class that 
enables the shortest description of the data without considering 
the number of bits needed to encode the class itself. Note that 
here we do not refer to encoding the parameter values as in 
two-part codes, which are done implicitly in the so-called 'one- 
part codes' such as the NML and mixture codes. In most cases 
there are not too many classes and hence omitting the code 
length of the model index has no practical consequence. When 
the number of model classes is large, however, this issue does 
become of importance. In the case of denoising, the number 
of different model classes is as large as 2" (with n as large 
as 512 X 512 = 262, 144) and, as we show, encoding of the 
class index is crucial. 

The encoding method we adopt for the class index is simple. 
We first encode fc, the number of retained coefficients with a 
uniform code, which is possible since the maximal number 
n is fixed. This part of the code can be ignored since it 
only adds a constant to all code lengths. Secondly, for each 
fc there are a number of different model classes depending 
on which fc coefficients are retained. Note that while the 
retained coefficients are always the largest fc coefficients, this 
information is not available to the decoder at this point and 
the index set to be retained has to be encoded. There are (^) 
sets of size fc, and we use a uniform code yielding a code 
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length In (") nats, corresponding to a prior probability 



7r(7) = 



k\{n-ky. 



(7) 



Applying Stirling's approximation to the factorials and 
ignoring all constants wrt. 7 gives the final code length formula 



kf 



2 fc3 ■ 



(8) 



The proof can be found in the appendix. 

This way of encoding the class index is by no means 
the only possibility but it will be seen to work sufficiently 
well, except for one curious limitation: As a consequence 
of modeling both the informative coefficients and the noise 
by densities from the same Gaussian model, the code length 
formula approaches the same value as k approaches either zero 
or n, which actually are disallowed. Hence, it may be that in 
cases where there is little information to recover, the random 
fluctuations in the data may yield a minimizing solution near 
k = n instead of a correct solution near fc = 0. A similar 
phenomenon has been demonstrated for "saturated" Bernoulli 
models with one parameter for each observation [27], and 
resembles the inconsistency problem of BIC in Markov chain 
order selection [40]: In all these cases pure random noise is 
incorrectly identified as maximally regular data. In order to 
prevent this we simply restrict k < .95n, which seems to 
avoid such problems. A general explanation and solution for 
these phenomena would be of interest^. 

B. Subband Adaptation 

It is an empirical fact that for most natural signals the 
coefficients on different subbands corresponding to different 
frequencies (and orientations in 2D data) have different char- 
acteristics. Basically, the finer the level, the more sparse the 
distribution of the coefficients, see Fig.^ (This is not the case 
for pure Gaussian noise or, more interestingly, signals with 
fractal structure [2].) Within the levels the histograms of the 
subbands for different orientations of 2D transforms typically 
differ somewhat, but the differences between orientations are 
not as significant as between levels. 

In order to take the subband structure of wavelet transforms 
into account, we let each subband 5 G {!,..., B} have its own 
variance, r^. We choose the set of the retained coefficients 
separately on each subband, and let 7^ denote the set of 
the retained coefficients on subband b, with ki, :— |7h|. For 
convenience, let 70 be the set of all the coefficients that are 
not retained. Note that this way we have ko + . . . + ki, = n. 
In order to encode the retained and the discarded coefficients 
on each subband, we use a similar code as in the 'flat' case 
(Sec. IIII-A> . For each subband 1, .... B, the number of nats 
needed is In {^''J- 

'Perhaps a solution could be found in algorithmic information theory 
(Kolmogorov complexity) and the concept of Kolmogorov minimal sufficient 
statistic [41] which is the simplest one of many equally efficient descriptions. 
However, for practical purposes, a modification of the concept is needed in 
order to account for the fluctuations near the extremes, which are succumbed 
by the constant 0(1) terms in algorithmic information theory. 




Fig. 1 . Log-scale representation of the empirical histograms of the wavelet 
coelficients on dyadic levels 6-9 for the Boat image (see Sec. IIVI below). 
Finer levels have narrower (more sparse) distributions than coarser levels; the 
finest level (9) is drawn with solid line. 



Ignoring again the constraint + cr 



'N 



> (rjf, the levels 



can be treated as separate sets of coefficients with their own 
Gaussian densities just as in the previous subsection, where 
we had two such sets. The code length function, including the 
code length for 7, becomes after Stirling's approximation to 
the Gamma function and ignoring constants as follows: 



B 



— In 

'^\2 k 

b=o 



(y") 



— In kh 
2 



El- 

6=1 



(9) 



The proof is omitted since it is entirely analogous to the 
proof of Eq. (|6|l (see the appendix), the only difference being 
that now we have B + 1 Gaussian densities instead of only 
two. Notwithstanding the added code-length for the retained 
indices, for the case B = 1 this coincides with the original 
setting, where the subband structure is ignored, Eq. (|6j, since 
we then have ko = n — ki. This code can be extended to allow 
/cfc = for some subbands simply by ignoring such subbands, 
which formally corresponds to reducing B in such cases'*. 

Finding the index sets 7b that minimize the NML code 
length simultaneously for all subbands b is computationally 
demanding. While on each subband the best choice always 
includes some ki, largest coefficients, the optimal choice on 
subband b depends on the choices made on the B — 1 other 
subbands. A reasonable approximate solution to the search 
problem is obtained by iteration through the subbands and, 
on each iteration, finding the locally optimal coefficient set 
on each subband, given the current solution on the other 
subbands. Since the total code length achieved by the current 
solution never increases, the algorithm eventually converges, 
typically after not more than five iterations. Algorithm 1 in 
Fig. 12] implements the above described method. Following 

'^In fact, when reducing B the constants ignored also get reduced. This 
effect is very small compared to terms in |9), and can be safely ignored since 
codes with positive constants added to the code lengths are always decodable. 
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Algorithm 1. 
Input: signal y" 



B} 



0. set c" ^ W^y" 

1. initialize kb — rib for all & G {1, . . . , 

2. do until convergence 

3. for each 6 e {Bo + 1, . . . ,B} 

4. optimize wrt. criterion (|9} 

5. end 

6. end 

7. for each z G {1, . . . , n} 

8. if i ^ 7 then set c„ ^ 

9. end 

10. output Wc" 



Fig. 2. Outline of an algorithm for subband-adaptive MDL denoising. The 
coarsest Bo subbands are not processed in the loop of Steps 3-5. In Step 8, 
the tinal model 7 is defined by the largest fcj, coefficients o n each subband b. 
A soft thresholding variation to Step 8 is described in Sec. IIII-CI 



established practice [9], [12], all coefficients are retained on 
the smallest (coarsest) subbands^. 

C. Soft Thresholding by Mixtures 

The methods described above can be used to determine the 
MDL model, defined by a subset 7 of the wavelet coefficients, 
that gives the shortest description to the observed data. How- 
ever, in many cases there are several models that achieve nearly 
as good a compression as the best one. Intuitively, it seems 
then too strict to choose the single best model and discard 
all the others. A modification of the procedure is to consider 
a mixture, where all models indexed by 7 are weighted by 
Eq. Q: 

/«,u-(2/") := fnmiiy" ; 7) 7^(7)- 

7 

Such a mixture model is universal (see e.g. [18], [19]) in 
the sense that with increasing sample size the per sample 



average of the code length 



ln/mu (2/") approaches that 



of the best 7 for all y". Consequently, predictions obtained by 
conditioning on past observations converge to the optimal ones 
achievable with the chosen model class. A similar approach 
with mixtures of trees has been applied in the context of 
compression [42]. 

For denoising purposes we need a slightly different setting 
since we cannot let n grow. Instead, given an observed 
signal y", consider another image z" from the same source. 
Denoising is now equivalent to predicting the mean value of 
z". Obtaining predictions for z" given from the mixture 
is in principle easy: one only needs to evaluate a conditional 
mixture 



/,«/.(^" 1 2/") 



^ ^ fnml{^ 



7 



'We retain all subbands below level 4, i.e., all subbands with 16 or less 
coefficients. This has little effect to the present method, but since it is 
important for other methods to which we compare, especially SureShrink, 
we adopted the practice in order to facilitate comparison. 



with new updated 'posterior' weights for the models, obtained 
by multiplying the NML density by the prior weights and 
normalizing wrt. 7: 

/«mz(2/" ; 7)7r(7) 



7r(7 I y") 



E7' ; 



(10) 



Since in the denoising problem we only need the mean value 
instead of a full predictive distribution for the coefficients, 
we can obtain the predicted mean as a weighted average of 
the predicted means corresponding to each 7 by replacing the 
density /,„„/(2;" | y" ; 7) by the coefficient value Ci — Ci{y^'-) 
obtained from y" for 1^7 and zero otherwise, which gives 
the denoised coefficients 



ihe^7r{j\yn^c,Tnij\y^) 



7 



79 j 



(11) 



where the indicator function I^g^ takes value one if z e 7 and 
zero otherwise. Thus the mixture prediction of the coefficient 
value is simply Ci times the sum of the weights of the models 
where i G 7 with the weights given by Eq. dlOt . 

The practical problem that arises in such a mixture model is 
that summing over all the 2" models is intractable. Since this 
sum appears as the denominator of dlOt . we cannot evaluate 
the required weights. We now derive a tractable approximation. 
To this end, let 71 . . . 7„ denote a model determined by i G 7 
iff 7i = 1, and let 71 . . . 1^ . . . 7„ denote a particular one with 
7i = 1. Also, let 7 = 71 . . .7„ be the model with maximal 
NML posterior weight (I10> . The weight with which each 
individual coefficient contributes to the mixture prediction can 
be obtained from 



y" 



E7^i'^(7ly") 1-^79^7^(7 12/") 



■i-f 



^Ai\yn = Y 

~l3i 



(12) 



Note that the ratio 7-; is equal to 
V 7r(7i ...I, 



■7« |2/") 



V , 7r(7j ...Oj...7; I y-^ 



This can be approximated by 
X;.^7r(7i...l, ■..7„ I y") ^ 



71" (71 



■7n I2/") 



■I'n 



2/ 



7r(7i 



.0,; 



• In I y" 



which means that the exponential sums in the numerator and 
the denominator are replaced by their largest terms assuming 
that forcing 7^ to be one or zero has no effect on the other 
components of 7. The ratio of two weights can be evaluated 
without knowing their common denominator, and hence this 
gives an efficient recipe for approximating the weights needed 
in Eq. ([TT}- 

Intuitively, if fixing 7^ = decreases the posterior weight 
significantly compared to 7,; = 1, the approximated value of 
ri becomes large and the i'th coefficient is retained near its 
maximum likelihood value c^. Conversely, coefficients that 
increase the code length when included in the model are 
shrunk towards zero. Thus, the mixing procedure implements a 
general form of 'soft' thresholding, of which a restricted piece- 
wise linear form has been found in many cases superior to hard 
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Fig. 3. The behavior of the soft thresholding method implemented by 
Algorithm 2 for one of the subbands of the Boat image with no added noise 
(see Sec.|Iv|: the original wavelet coefficient value Ci on the x-axis, and the 
thresholded value Cifi/{1 + fi) on the y-axis. For coefficients with large 
absolute value, the curve approaches the diagonal (dotted line). The general 
shape of the curve is always the same but the scale depends on the data: the 
more noise, the wider the non-linear part. 

thresholding in earlier work [8], [12]. Such soft thresholding 
rules have been justified in earlier works by their improved 
theoretical and empirical properties, while here they arise 
naturally from a universal mixture code. The whole procedure 
for mixing different coefficient subsets can be implemented by 
replacing Step 8 of Algorithm 1 in Fig.|2lby the instruction 

Set Co ^ Co 

l + n 

where fi denotes the approximated value of r,. The behavior 
of the resulting soft threshold is illustrated in Fig. |3] 

IV. Experimental Results 
A. Data and Setting 

The effect of the three refinements of the MDL denoising 
method was assessed separately and together on a set of arti- 
ficial ID signals [9] and natural images^ commonly used for 
benchmarking. The signals were contaminated with Gaussian 
pseudo-random noise of known variance a^, and the denoised 
signal was compared with the original signal. The Daubechies 
D6 wavelet basis was used in all experiments, both in the ID 
and 2D cases. The error was measured by the peak-signal-to- 
noise ratio (PSNR), defined as 

P5A.^:=10.1og,„(^), 

where Range is the difference between the maximum and 
minimum values of the signal (for images Range = 255); and 
MSE is the mean squared error. The experiment was repeated 
15 times for each value of a^, and the mean value and standard 
deviation was recorded. 

'The images were the same as used in many earlier papers, available at 
fhttp : / / decsai ■ ugr . es/ ~ javier/denoise/| 



The compared denoising methods were the original MDL 
method [23] without modifications; MDL with the modifica- 
tion of Sec, nil- AI MDL with the modifications of Secs. lIII-Al 
and IIII-Bl and MDL with the modifications of Sees. IIII-AI 
IIII-BI and IIII-CI For comparison, we also give results for 
three general denoising methods applicable to both ID and 
2D signals, namely VisuShrink [8], SureShrink [9], and 
BayesShrink [12]^ 

B. Results 

Figure 0] illustrates the denoising results for the Blocks 
signal [9] with signal length n — 2048. The original signal, 
shown in the top-left display, is piece-wise constant. The 
standard deviation of the noise is cr = 0.5. The best method, 
having the highest PSNR (and equivalently, the smallest MSE) 
is the MDL method with all the modifications proposed in the 
present work, labeled MDL (A-B-C) in the figure. Another 
case, the Peppers image with noise standard deviation cr ~ 30, 
is shown in Fig. |5] where the best method is BayesShrink. 
Visually, SureShrink and BayesShrink give a similar result 
with some remainder noise left, while MDL (A-B-C) has 
removed almost all noise but suffers from some blurring. 

The relative performance of the methods depends strongly 
on the noise level. Figure |6l illustrates this dependency in 
terms of the relative PSNR compared to the MDL (A-B-C) 
method. It can be seen that the MDL (A-B-C) is uniformly 
the best among the four MDL methods except for a range 
of small noise levels in the Peppers case, where the original 
method [23] is slightly better. Moreover, it can be seen that 
the modifications of Sees. IIII-51 and UlLCl improve the perfor- 
mance on all noise levels for both signals. The right panels 
of Fig. |6l show that the overall best method is BayesShrink, 
except for small noise levels in Blocks, where the MDL (A- 
B-C) method is the best. This is explained by the fact that the 
generalized Gaussian model used in BayesShrink is especially 
apt for natural images but less so for ID signals of the kind 
used in the experiments. 

The above observations generalize to other ID signals and 
images as well, as shown by Tables HI and HH For some ID 
signals (Heavisine, Doppler) the SureShrink method is best 
for some noise levels. In images, BayesShrink is consistently 
superior for low noise cases, although it can be debated 
whether the test setting where the denoised image is compared 
to the original image, which in itself akeady contains some 
noise, gives meaningful results in the low noise regime. For 
moderate to high noise levels, BayesShrink, MDL (A-B-C) 
and SureShrink typically give similar PSNR output. 

V. Conclusions 

We have revisited an earlier MDL method for wavelet- 
based denoising for signals with additive Gaussian white 
noise. In doing so we gave an alternative interpretation of 

'AU the compared methods are available as a free package, downloadable at 
http : / / www .cs.helsinki. f i / teemu . roos/de noise/ The pack- 
age includes the source code in C, using wavelet tra nsfo rms from the Gnu 
Scientific Library (GSL). All the experiments of Sec. llVl can be reproduced 
using the package. 
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Fig. 4. Simulation Results. Panels from top to bottom, left to right: Blocks signal [9], sample size n = 2048; noisy signal, noise standard deviation cr = 0.5, 
PSNR=23.2; original MDL method [23], PS NR=28.5; MD L with m odification of Sec. lIII-AI PSNR=29.0; MDL with modifications of Sees. IIII- Al and HITbI 
PSNR=29.6; MDL wifii modifications of Secs. lIirAllnrBl aiid lnrcl PSNR=30.1; VisuShrink [8], PSNR=28.6; SureShiink [9], PSNR=28.9; BayesShrink [12], 
PSNR=29.8. (Higher PSNR is better). 



Rissanen's renormalization technique for avoiding the problem 
of unbounded parametric complexity in normalized maximum 
likelihood (NML) codes. This new interpretation suggested 
three refinements to the basic MDL method which were shown 
to significantly improve empirical performance. 

The most significant contributions are: i) an approach in- 
volving what we called the extended model, to the problem 
of unbounded parametric complexity which may be useful 
not only in the Gaussian model but, for instance, in the 
Poisson and geometric families of distributions with suitable 
prior densities for the parameters; ii) a demonstration of the 
importance of encoding the model index when the number 
of potential models is large; iii) a combination of universal 
models of the mixture and NML types, and a related predictive 
technique which should also be useful in MDL denoising 
methods (e.g. [20], [21], [24]) that are based on finding a 
single best model, and other predictive tasks. 

Appendix I 
Postponed Proofs 

Proof of Eq. The proof of Eq. (|6} is technically similar 
to the derivation of the renormalized NML model in [23], 
which goes back to [43]. First note that due to orthonormality. 



the density of under the extended model is always equal 
to the density of c" evaluated at W^y". Thus, for instance, 
the maximum likelihood parameters for data y" are easily 
obtained by maximizing the density of c" at W^y". The 
density of c" is given by 

/(c" ; alal) ^\{c^{c, ; 0,^?) J]0(c, ; O.a^,), (13) 

where 0(- ; /i, a^) denotes a Gaussian density function with 
mean /z and variance . 

Let (y") be the sum of squares of the wavelet coefficients 
with i G 7: 

and let S{y^) denote the sum of all wavelet coefficients. With 
slight abuse of notation, we also denote these two by 6*^(0") 
and S{c"), respectively. Let k be the size of the set 7. 
The likelihood is maximized by parameters given by 

.2 s^jyn ^2 ^(?/")-^7(y") 

With the maximum likelihood parameters ( I14> the likelihood 
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VisuShrink SureShrink BayesShrink 

Fig. 5. Simulation Results. Panels from top to bottom, left to right: Peppers imag e, n = 256 X 256; noisy image, noise standard de viation a = 30, 
PSNR=18.6; original MDL method [23], PS NR=19.9; MD L with m odification of Sec. 1111- Al PSNR=23.9; MDL with modifications of Sees. Illl-Al and llirni 
PSNR=24.9; MDL with modifications of Secs. nTTAllnrBl and lnrcl PSNR=25.5; VisuShrink [8], PSNR=23.2; SureShrink [9], PSNR=24.6; BayesShrink [12], 
PSNR=25.9. (Higher PSNR is better). 



J13> becomes 

(15) 

The normalization constant C is also easier to evaluate by 
integrating the likelihood in terms of c": 



(16) 

where A is given by 

and the range of integration R is defined by requiring that 
the maximum likelihood estimators ( I14> are both within the 
interval [cr^jjn , Cmax] ■ will be seen that the integral diverges 



without these bounds. The integral factors in two parts involv- 
ing only the coefficients with 1^7 and 1^7 respectively. 
Furthermore, the resulting two integrals depend on the coeffi- 
cients only through the values 5'-y(c") and 5(c") — 5-y(c"), and 
thus, they can be expressed in terms of these two quantities 
as the integration variables - we denote them respectively by 
si and S2- The associated Riemannian volume elements are 

infinitesimally thin spherical shells (surfaces of balls); the first 

1/2 

one with dimension k and radius Si , the second one with 

1/2 

dimension n — k and radius 53 , given by 



T{k/2) 



■ dsi, 



^(n-fc)/2g(»-fc)/2- 

n{n-k)/2) 



dS2- 
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Fig. 6. Simulation Results. PSNR difference compared to the proposed method (MDL with modifications of Sees . IIII- AI IIII-B I and IIII-Cl . see Figs.l4landl5l 
Top ro w: B locks signal [9], sample size n = 2048. Bottom row: Peppers image, n = 256 X 256. Left panels show the effect of each of the three modifications 
in Sec. IIIII right panels show comparison to VisuShrink [8], SureShrink [9], and BayesShrink [12]. 



finally gives the NML density: 

r(A:/2)r((n-fc)/2) 



7r"/2(S'^(y"))fc/2(S'(jy«) - 5^(yn))(«-fc)/2 



Thus the integral in M6\ is equivalent to 

kai-^ r(fc/2) 

r((n-fc)/2) 

Both integrands become simply of the form l/x and hence, the corresponding code length becomes 

the value of the integral is given by 

k 

— In f„...,(i/'M = 



X In 



V2 / ^2 

In ^^-"^ 



r(fc/2)r((n-/c)/2) V cTi 



2 

mill 



(17) 



n — k 
I 

' n — k 



Plugging ini into ( I16> gives the value of the normalization 
constant 



inr - -inr , 
V2y V 2 

T) (1^ 

-ln7r + 21nln-2^. 
2 a ■ 



, (18) 



2 
mill 



(2e)"/2r(A:/2)r((n-fc)/2) V 
Normalizing the numerator ( I15t by C, and canceling like terms 



Applying Stirling's approximation 



In r(z) w (z — — jlnz — z+ — In 27r, 



DRAFT, JUNE 28, 2006 



10 



TABLE I 

Numerical Results. The peak-signal-to-noise ratio for various ID signals, denoising methods, and noise levels. Columns: noise 

STANDARD DEVIATION (j; PSNR FOR DIFFERENT METHODS (SEE FiGS .|4]aNd[5), BEST VALUE(S) IN BOLDFACE; SD: STANDARD DEVIATION OF ALL 

PSNR'S FOR EACH VALUE OF a OVER 15 REPETITIONS. 
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to the Gamma functions yields now 



ln/™,(y") « \ ln5,(y") + ^ ln(5(y") - 5^(y")) 



1 



In 



k - 1 



k 

n ~ k 



77 (T 

ln27r+ -In7r + 21nln^i2i. 
2 cr^,. 



Rearranging the terms gives the formula 

n-k^ S{y'') - S^{y^') 



ln/„m/(2/") 



k 



~k 



1 



+ - lnA:(n - fc) + co«if, (19) 



where const is a constant wrt. 7, given by 



conif = — In 27re — In 47r + 2 In In ■ 



Proof of Proposition The maximum likelihood parame- 
ters J14> may violate the restriction crl > cr^ that arises from 
the definition aj := + crj^. The restriction affects range of 
integration in Eq. Ml\ giving the non-constant terms as follows 



(n-fc)o-,^i„ 



S]^ ^ (is2 dsi 



sr'(lnsi -lnfcaf„iJdsi. (20) 



Using the integral / ^ Insi dsi = i(lnsi)^ gives then 
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where the first two terms can be written as 
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Combining with the third term of M\\ changes the plus into 
a minus and gives finally 



In ^21^ 



In 



which is exactly half of the integral in Eq. Mil , the constant 
terms being the same. Thus, the effect of the restriction on 
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TABLE II 

Numerical Results. The peak-signal-to-noise ratio for various images, denoising methods, and noise levels. Columns: noise 

STANDARD DEVIATION (j; PSNR FOR DIFFERENT METHODS (SEE FiGS .|4]aNd[5), BEST VALUE(S) IN BOLDFACE; SD: STANDARD DEVIATION OF ALL 

PSNR'S FOR EACH VALUE OF a OVER 15 REPETITIONS. 
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the code length where the logarithm of the integral is taken, 
is one bit, i.e.. In 2 nats. 

■ 

Proof of Eq.^ The relevant terms in the code length In 
i.e. those depending on k, for the index of the model class are 

- ln(fc!(n - fc)!) = - ln[fc(fc - l)!(n - k){n - A:)!] 

= - hi(fc(n - fc)) - Inr(fc) - lnr(n - fc), 

which gives after Stirling's approximation (ignoring constant 
terms) 

- ln(fc(n - fc)) - (^fc - hi fc + fc 

— — fc — ln(n — fc) + (n — fc) 

fc n — k 1 
= — — In fc" — ln(n — fc)" + — In fc(n — fc) + n. (22) 

Adding this to Eq. |6l (without the constant n) gives Eq. (|8j. 
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